Naive Bayes

1 Background

Given a class variable $y$ and a dependent feature vector $x_1$ through $x_n$, Bayes' theorem states the following relationship.
$$P(y|x_1,\ldots,x_n)=\frac{P(y)P(x_1,\ldots,x_n | y)}{P(x_1,\ldots,x_n)}$$ Using the naive independence assumption that $$P(x_i|y,x_1,\ldots,x_{i-1},x_{i+1},\ldots,x_n)=P(x_i|y)$$ for all $i$ this relationship is simplified to $$P(y|x_1,\ldots,x_n)=\frac{P(y)\prod_{i=1}^{n}P(x_i|y)}{P(x_1,\ldots,x_n)}$$ Since $P(x_1,\ldots,x_n)$ is constant given the input, we can use the following classification rule: $$P(y|x_1,\ldots,x_n) \propto P(y)\prod_{i=1}^{n}P(x_i|y)$$ $\Rightarrow$ $$\hat{y}=arg \overset{max}{y} P(y)\prod_{i=1}^{n}P(x_i|y)$$

2 Gaussian Naive Bayes

$$P(x_i|y)=\frac{1}{\sqrt{2\pi\sigma_{y}^2}}exp\left( -\frac{(x_i-\mu_y)^2}{2\sigma_y^2}\right)$$

In [2]:
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
y_pred=gnb.fit(iris.data, iris.target).predict(iris.data)

In [4]:
print('Number of mislabeled points out of the a total %d points: %d'
      % (iris.data.shape[0], (iris.target!=y_pred).sum()))


Number of mislabeled points out of the a total 150 points: 6

3 Multinomial Naive Bayes

Implements the naive bayes algorithm for multinomial distributed data, and is one of th two classcci Naive Bayes variants in text classification.


In [24]:
import numpy as np
x_1=np.array([1,1,1,1,1,
              2,2,2,2,2,
              3,3,3,3,3])
x_2=np.array(['S','M','M','S','S',
              'S','M','M','L','L',
              'L','M','M','L','L'])
X=np.vstack((x_1,x_2)).T
y=np.array([-1,-1,1,1,-1,
            -1,-1,1,1,1,
            1, 1, 1,1,-1])
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB()
clf.fit(X,y)


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-bad21f44acb2> in <module>()
     12 from sklearn.naive_bayes import MultinomialNB
     13 clf=MultinomialNB()
---> 14 clf.fit(X,y)

/Users/gaufung/anaconda/lib/python3.6/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    585         self.feature_count_ = np.zeros((n_effective_classes, n_features),
    586                                        dtype=np.float64)
--> 587         self._count(X, Y)
    588         self._update_feature_log_prob()
    589         self._update_class_log_prior(class_prior=class_prior)

/Users/gaufung/anaconda/lib/python3.6/site-packages/sklearn/naive_bayes.py in _count(self, X, Y)
    687     def _count(self, X, Y):
    688         """Count and smooth feature occurrences."""
--> 689         if np.any((X.data if issparse(X) else X) < 0):
    690             raise ValueError("Input X must be non-negative")
    691         self.feature_count_ += safe_sparse_dot(Y.T, X)

TypeError: '<' not supported between instances of 'numpy.ndarray' and 'int'

4 Bernoulli Naive Bayes

implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions;


In [ ]: